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CN Abstract 



Predicting user affinity to items is an important problem in applications 
like content optimization, computational advertising, and many more. While 
bilinear random effect models (matrix factorization) provide state-of-the-art 



Q performance when minimizing RMSE through a Gaussian response model on 

'~~' explicit ratings data, applying it to imbalanced binary response data presents 

^__{ additional challenges that we carefully study in this paper. Data in many 

^ applications usually consist of users' implicit response that are often binary - 

^^ clicking an item or not; the goal is to predict click rates (i.e., probabilities), 

^— I which is often combined with other measures to calculate utilities to rank 

^O items at runtime of the recommender systems. Because of the implicit nature, 

CO such data are usually much larger than explicit rating data and often have an 

imbalanced distribution with a small fraction of click events, making accurate 
click rate prediction difficult. In this paper, we address two problems. First, 
we show previous techniques to estimate bilinear random effect (BIRE) models 
with binary data are less accurate compared to our new approach based on 
adaptive rejection sampling, especially for imbalanced response. Second, we 
C^ develop a parallel bilinear random effect model fitting framework using Map- 

Reduce paradigm that scales to massive datasets. Our parallel algorithm is 
based on a "divide and conquer" strategy coupled with an ensemble approach. 
Through experiments on the benchmark MovieLens IM data, a small Yahoo! 
Front Page Today Module data set, and a large Yahoo! Front Page Today 
Module data set that contains 8M users and IB binary observations, we show 
that careful handling of binary response as well as identifiability issues are 
needed to achieve good performance for click rate prediction, and that the 



proposed adaptive rejection sampler and the partitioning as well as ensemble 
techniques significantly improve model performance. 

1 Introduction 

Personalized item recommendation is an important task in many web applications, 



such as content optimization (Agarwal et a/., 2008), computational advertising (Broder 



2008), and others. Such systems recommend a set of items like article links, ads, 
product links, etc for each user visit; users respond by clicking and/or engaging in 
other activities post-click. Personalizing such recommendations based on the user's 
demographic information and browsing history, typically leads to better user engage- 
ment and profit for organizations. Accurate prediction of the probability of a user 
clicking an item, is an important input to facilitate such personalization. 

For a user i and an item j, let Pij be the probability of user i clicking item j. If 
we let Vj denote the utility of clicking item j (e.g., the ad revenue associated with 
a click on item j), the system may rank items based on some function of rj and pij 
to maximize the utility. In computational advertising for instance, ranking is based 
on the expected revenue VjPij. Click probabilities are usually estimated through a 
statistical model trained on past click data. Intuitively, we can think of the click 
data as a binary matrix Y, such that entry yij = 1 if user i clicked item j, and 
Hij = if user i viewed but did not click on j. We note that 1^ is a highly incomplete 
matrix with many unobserved entries since each user usually views a small number 
of items. The goal is to predict pij for unobserved {i,j) pairs. We also note that 
in many web applications the click-rates are small giving rise to highly imbalanced 
binary response data. 

1.1 Background and Literature 

The problem of personalized item recommendation described above is closely re- 
lated to a rich literature on recommender systems and collaborative filtering. An 



overview can be seen in Adomavicius and Tuzhilin (2005). Recommender systems 



are algorithms that model user-item interactions to facilitate the process of personal- 
ized item recommendations. There are two types of approaches that are widely used 
in recommender systems: content-basedsippToa.ch.es andcoUaborative filtering. The 
content-based approaches use only user and item covariates to model the user-item 
interaction. Collaborative filtering approaches model user-item interactions by user's 
past response alone, no covariates are used. However, in real recommender systems 
we often observe both "warm-start" and "cold-start" scenarios: "Warm-start" means 



we have past observations from this user/item so that both the past responses and 
covariates can be used in modehng. "Cold-start" scenario represents when a new 
user /item comes to the system; hence we do not have any past responses but may 
still have the covariates. To handle both scenarios a hybrid approach that combines 
content-based and collaborative filtering is often used in recommender systems. 
Nearest-neighbor methods are widely used in collaborative filtering (e.g 



Sarwar 



et al. (2001), Wang et al. ( 2006|)). They are very popular in large-scale commercial 
systems, such as Nag (2008), Linden et al. (2003). The basic idea of the nearest- 



neighbor methods is to compute item-item similarity or user-user similarity from 
Pearson correlation, cosine similarity or Jaccard similarity of the responses of a 
pair of users/items. Then for each unobserved user-item pair, the prediction is 
simply a weighted average of the set of nearest neighbor's responses, and the weights 



come from the similarity measures. More recently, Agarwal et al. (2011) proposed a 



Bayesian hierarchical modeling approach to model the item-item similarity in a more 
principled way. 



Since the Netflix challenge (Bennett and Lanning, 2007), the SVD-style matrix 



factorization methods have been well known to provide state-of-the-art performance 
for recommender problems. In this paper, we shall refer to the class of matrix factor- 
ization models as bilinear random effects (BIRE) models. A theoretical perspective 



of this problem was first provided in Srebro et al. (2005). Bell et al. (2007); Bennett 



and Lanning (2007) have successfully used this strategy in collaborative filtering ap- 



plications. Salakhutdinov and Mnih ( 2008a|[b ) formulate a probabilistic framework 
using a hierarchical random-effects model where the user and item factors are mul- 
tivariate random-effects (factor vectors) that were regularized through zero-mean 
multivariate Gaussian priors. These papers used bilinear random effects as a tool 
to solve pure collaborative filtering problems; they do not work well in applications 
with significant cold-starts which is commonplace in several applications. 

In light of the "cold-start" problem, a desirable approach would be to have a 
model that provides predictive accuracy as BIRE for warm-start scenarios but fall- 
backs on a feature-based regression model in cold-start scenarios. Incorporating 
both warm-start and cold-start scenarios simultaneously in collaborative filtering 
is a well studied problem with a rich literature. Several methods that combine 
content and collaborative filtering have been studied. For instance, Balabanovic 



and Shoham (1997) present a recommender system that computes user similarities 



based on content-based profiles. In Claypool et al. (1999), collaborative filtering and 



content-based filtering are combined linearly with weights adjusted based on abso- 



lute errors of the models. In Melville et al. (2002), content based models are used to 



fill up the rating matrix followed by recommendation based on similarity (memory) 



based methods (iBreese et al.\ 119981). In IGood et al.\ (|1999l); iPark et al.\ (12006^, /i/ 



terbots are used to improve cold-start recommendations. Schein ei al. (2003) extend 



the aspect model to combine the item content with user ratings under a single prob- 
abilistic framework. In Agarwal and Chen (2009); Stern et al. (2009), a principled 



solution is proposed by using linear model regression priors on the user and item 
random effects through features. This is generalized in Zhang et al. (2011) so that 



the regression priors can be non-linear functions which further improves the model 
predictive accuracy. These solutions based on incorporating features into random 
effects itself are superior than the classical methods of dealing with warm-start and 
cold-start scenarios simultaneously. 



1.2 Statistical Challenges 

The basic idea of the bilinear random effect (BIRE) models is to approximate the 
response i/ij from user i and item j by an inner product of the user random effect Ui 
and the item random effect Vj. For Gaussian responses, the loss function is usually 
RMSE, i.e. min^..(yjj — u[vj)^ over observed (z,j) pairs. Due to a preponderance 
of missing entries in Y, the user/item random effects have to be regularized further 
to avoid over-fitting the training data. This is usually done by constraining latent 
profiles through the L2 norms or equivalently assuming zero-mean Gaussian priors on 
the user/item random effects. To handle cold-start scenarios, instead of zero-mean 
Gaussian priors, covariates-based regression priors can be used on the user /item ran- 
dom effects, and both random effects and the regression parameters are estimated 



simultaneously through a Monte Carlo EM (MCEM) algorithm (Booth and Robert 



1999). Agarwal and Chen (2009); Zhang et al. ( 2011[ ) show that this modeling frame- 
work gives state-of-the-art performance, especially for Gaussian user-item interaction 
responses. However, in some real recommender systems we often observe the follow- 
ing two major challenges: 

• Imbalanced binary response: Many web applications depend on implicit 
user feedback that are in the form of events like clicks. Further, these events 
are usually rare; this gives rise to imbalanced binary response data. Accurate 
estimation of probabilities with imbalanced binary response is known to be a 



difficult problem (Owen, 2007) even in the case of ordinary logistic regression; 



performing such estimation for elaborate BIRE models introduces additional 
challenges that have not been carefully studied before. 

Large scale data: As every display of an item to a user generates an ob- 
servation, data collected from these applications is massive for large systems 



such as major websites. In fact, the entire data often does not fit into memory 
and resides in large distributed data clusters. Scalable model fitting using dis- 



tributed computing paradigm like Map- Reduce (Dean and Ghemawat 2008) 
is an attractive option. However, the EM algorithms used to fit such EIRE 
models require sequential processing of data and are not directly amenable to 
computing in a Map-Reduce paradigm. Therefore, scalable model fitting for 
such massive datasets is still a challenge. 

We address both of the challenges mentioned above in this paper. First, although a 



variational approximation method has been proposed in Agarwal and Chen (2009) to 
perform approximate sampling of the user/item random effects for binary response 
data in the E-step of the Monte Carlo EM model fitting procedure, we show such an 
approximation does not give optimal prediction accuracy, especially for imbalanced 
binary response. By using an adaptive rejection sampling (ARS) technique to sample 
the factors exactly in the E-step, we can significantly improve prediction accuracy. 
We also find model identifiability issues when dealing with imbalanced binary re- 
sponse that hurts model accuracy; we show that this can be handled effectively in 
our framework by enforcing a few additional constraints on the random effects. To 
our surprise, we find the variational approximation deteriorates as the response gets 
rare and it happens because the random effect estimates shrink more compared to 
the exact ARS sampler. 

To handle model fitting for massive datasets, we develop a parallel BIRE model 
fitting framework based on Map-Reduce that fits accurate models in a scalable way. 
Our method is based on two key ideas - (a) creating several random partitions of the 
data and fitting separate models to each in parallel, and (b) an ensemble approach 
to refine the random effect estimates. The ensemble is created by using several runs 
of random partitioning of the data and averaging over the random effect estimates 
from each run. This combination of divide and conquer coupled with ensembles 
provides a simple yet effective procedure. Due to multi-modal nature of the posterior 
distribution, careful initialization that synchronizes factor estimates across partitions 
is important. The partitioning method employed also has an impact on performance; 
we provide a detailed study of these issues also. 

To summarize, we make the following contributions. We provide a careful study 
of fitting regression based bilinear random effect (BIRE) models to binary response 
data that are commonplace in many web applications. We show that when modeling 
rare response, previously proposed methods can be improved by exact sampling of 
factors through an adaptive rejection sampling procedure. We provide a modeling 
strategy that scales to massive datasets in a Map- Reduce framework. Our method is 
based on a divide and conquer strategy coupled with an ensemble approach. We show 



that exact sampling of random effects and carefully handling identifiability issues can 
make a significant difference in model performance. We compare our method with 
various baselines on benchmark data and illustrate impressive gains on a content 
optimization problem on the Today module of Yahoo! front page. 

2 Personalized Item Recommendation on Yahoo! 
Front Page Today Module 

The Yahoo! front page f www. yahoo, com) is a major web portal that attracts hun- 
dreds of millions of visitors every day. Figure [T] shows a snapshot of the Yahoo! front 
page. The Today module is one of the most obvious modules on the web page. It 
displays article links on four positions labeled as Fl through F4. The article link 
at the Fl position is displayed in a large and prime area in the module by default, 
while article links at F2 to F4 are displayed in the smaller footer area. A mouse 
hovering on a non-Fl article link will bring the article link to the prime area. A 
click on an article link in the prime area will then lead the user to the actual article 
page. Since there are usually multiple links corresponding to one article, including 
title, related images and videos, we shall refer the entire bundle of the links as one 
article and treat a click on any of the links as a click on the article. The personalized 
recommender system of the Today module is a combination of editorial oversight and 
statistical modeling. Statistical models are used to predict the probability of a user 
clicking an article; hence using the models to rank articles and show the best ones 
to the users gives optimal click through rates (CTR, the number of clicks divided by 
the number of views of the article) and improves user satisfaction and engagement 
for the front page. However, it has been well known that simply applying statistical 
modeling on a large and unscreened article inventory to optimize CTR is not opti- 
mal, sometimes even dangerous. For example, articles with salacious titles often have 
extremely high CTR, but in fact those are inappropriate for such a major web portal 
and it will cause Yahoo! to lose reputation. Therefore, in reality trained human 
editors at Yahoo! manually create and update the article inventory that contains 
30-40 articles at any given time, and statistical CTR prediction models such as the 
ones described in this paper, pick the best 4 articles to show on position F1-F4 from 
the inventory based on the user's covariates and previous browsing history. Also, 
please note that the lifetime of each article in the inventory is usually quite short, 
ranging from several hours to 1 day. 



"iXHoC 




Figure 1: A snapshot of the Yahoo! front page and its Today module. 



3 Regression-based Bilinear Random Effects Model 

In this section, we describe a probabihstic bihnear random effects (BIRE) model that 
leverages covariates to handle the cold-start problem and has been shown to provide 



state-of-the-art performance on a number of relatively small datasets (Agarwal and 



Chen 2009 Zhang et a/. 2011). We only describe the model with logistic link 



function for binary response, and refer the readers to Zhang et al. (2011) for the 



Gaussian model for numeric response and Poisson model for count data. 

Notation:. Let Hij denote whether user i clicks item j. Since we always use i to 
denote a user and j to denote an item, by slight abuse of notations, we let x,, Xj 
and Xij denote covariate vectors of user i, item j and pair (i, j). For example, the 
user covariate vector Xi may include age, gender and behavioral covariates. The item 
covariate vector Xj may include content categories, keywords, named entities, etc. 
The covariate vector Xij contains observation-specific covariates that are not entirely 
attributable to either user or item, e.g., time-of-day of the observation, position of 
the item on the displayed web page. 



Model: Our objective is to model the unobserved probability Pij that user i would 
click item j. Specifically, we assume a Bernoulli model using the logistic link function: 



Hij ~ Bernoulli (p. 



I] J 



Let Sij 



log T-^-^^ denote the log odds. We model Sj,- by 



-Pij 



Sij = f{xij) + ai + Pj + u[vj, 



(1) 



(2) 



where f{xij) is a regression function based on covariate vector Xij; ai and Ui are 
latent random effects (factors) representing the user bias and the r-dimensional latent 
profile of user z, respectively; and (5j and Vj are latent random effects (factors) 
representing the item popularity and the r-dimensional latent profile of item j. 

Flexible Regression Priors: Because the above model is usually over-parameterized 
with a large number of latent factors, it is important to regularize the factors to pre- 
vent over-fitting. A common practice is to shrink the factors toward zero. However, 
it fails to handle the cold-start problem because the predicted factor values of new 
users or items will all be zero. A better approach is to shrink factors to values pre- 



dicted based on features (Agarwal and Chen, 2009 Zhang et al., 2011). Specifically, 



we put the following priors on Oj, /5j 

Ui ~ N{g{x.i),a 
13 j ^ N{h{xj),apj, 



, Ui and Vj-. 



an 
2\ 



Ui 



«i~ 



Ar(G(x,),a^/), 



(3) 



where g and h are any choices of regression functions that return scalars, and G 
and H are regression functions that return r-dimensional vectors. These regression 
functions can be linear as in Agarwal and Chen (2009) or non-linear (e.g., decision 



tree, forest, etc.) as in Zhang et al. (2011) 



To better understand the usefulness of regression priors, take Ui for example. If 
user 2 is a new user, then Ui is predicted by G'(xi), where the regression function G 
is learned based on users who interacted with some items in the training data. Let 
G{xi) = {Gi{xi), ...,Gr{xi)). One example of G is to use a regression tree Gk for 
each latent dimension k to predict the value of the k-th dimension of a user's latent 
profile based on his/her covariate vector. If covariates are predictive, we would be 
able to make accurate click rate prediction for new users. 



4 Model Fitting for Binary Data 

In this section, we describe the model fitting procedure based on the Monte Carlo 
Expectation Maximization (MCEM) algorithm for datasets that can fit in a single 



machine. This is the building block for large data scenario stored in distributed 
clusters as described in Section|5] We first describe the general fitting procedure using 
the MCEM algorithm (Booth and Robert 1999) in Section 4.1 and then introduce 



two ways to handle binary response in the E-Step: the variational approximation 



method (VAR) in Section 4.2 and the adaptive rejection sampling method (ARS) 



in Section |4.3[ Finally, we briefly describe the M-Step in Section 4.4 it is the same 
as that in Zhang et al. (2011) but repeated here for comprehensiveness. 



4.1 The MCEM Algorithm 

Let = (/, g, h, G, H, a^, a^, a^, a^) be the set of prior parameters (also referred to 
as hyper-parameters). Let A = {aj, f3j, Ui, Vj}\/ij be the set of latent random effects 
(also referred to as factors). Let y denote the set of observed binary response. For 
M users and N items, the complete data log-likelihood is given by 

\ogL{@; A,y) = logPr[7/, A|©] = constant 

- J2ij Vij log(l + exp(-/(a;y) - a^ - f3j - u'^Vj)) 

- Eij(l - Vtj) log(l + exp(/(xij) + a, + (3^ + u'^Vj)) 



2al 



' E.K-^(^.))'-floga^ 



(4) 



-4E,(/3.- 


- h{xj)f - 


floga^ 


2llT.^ Ui 


-G{x,W 


-f^loga^ 


2I1 E, 1 ^. 


-H{x,)\' 


-f loga.2 



To apply the standard EM algorithm (Dempster et al, 1977), we can treat A as 



missing values and find the optimal estimate for that maximizes the marginal 
likelihood 

FT[y\&]=jLi&;A,y)dA. (5) 

The EM algorithm iterates between an E-step and a M-step until convergence. Let 
0^*) denote the current estimated value of at the beginning of the t-th iteration. 

• E-step: We take expectation of the complete data log likelihood with respect 
to the posterior distribution of the latent random effects A conditional on 
observed data y and the current estimate of 0; i.e., compute 

q,{&) = E^[\ogLi&;A,y)\&^'\y] (6) 

as a function of 0, where the expectation is taken over the posterior distribu- 
tion of p(A I 0*^*\ y) and G)*^*-* is treated as a set of constants. The output of 
the E-step consists of a set of sufficient statistics to be used in the M-step. 
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M-step: We maximize the expected complete data log-likelihood from the 
E-step to obtain updated values of 0; i.e., find 



©(*+!) = argmax qt{@). 



(7) 



Note that in the E-Step, the posterior p(A | 0*^*'', y) is not available in closed form. 
Thus, we compute Monte Carlo means based on Gibbs samples following the MCEM 



algorithm (Booth and Hobert 1999 Agarwal and Chen 2009 Zhang et ai, 2011). 



According to Salakhutdinov and Mnih (2008a); Agarwal and Chen (2009), this ap 



proach provides better predictive accuracy and avoids over-fitting while it remains 
scalable compared to other choices such as the iterative conditional mode (ICM) 
algorithm. 

Before we describe the E-Step, we first provide the formula for qt{&)- Let 6 = 
E[6\&^*'^^\ y] and V[5] = Var[5|©'^*+^\ t/], where 6 can be one of Oj, /3j, Ui and Vj. 
Then, we have 

qt{@) = E^[\ogL{&;A,y) \ @^^\y] = constant 

- Eij yijE[\og{l + exp{-f{xij) -a,- (5j - Kvj))] 

- E.,(l - y^,)E[\og{l + exp(/(x,,) + a, + (3, + u[v,))] 

-^^Y.,{{h-Kx,)f + V{(5,])-^\ogal 
-^^Y.^{\\u.-G{x,W + iT{V[u,]))-^f\ogal 



(8) 



f^log^.^. 



It is easy to see that the sufficient statistics for maximizing gt(©) are dj, /3j, iii, Vj 
for all I and j, as well as E,^N, Ej^Wj], J2MV[Ui\) and E, tr(V^[i;,]). This 
set of quantities is computed based on L Gibbs samples and is the output of the 
E-step. We note that the first two terms are difficult to expand and we will use 
plug-in estimates of ««, (3j, Ui and Vj to determine a near optimal solution for / in 
the M-step. 

4.2 Variational Method in E-Step 

Since E/iJ\\.ogL{&] A, y) \ ©*^*)] is not available in closed form, we compute the Monte- 



Carlo expectation based on L samples generated by a Gibbs sampler (Gelfand, 2000). 



The Gibbs sampler repeats the following procedure L times. In the following text. 
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we use {6 \ Rest), where 6 can be one of a,, f3j, Ui, and Vj, to denote the conditional 
distribution of 6 given all the other latent random effects and the observations y. 
Let Xj denote the set of users who rated item j, and J'i denote the set of items rated 
by user i. 



The variational approximation is based on Jaakkola and Jordan (2000) and was 



proposed in Agarwal and Chen (2009) to factorize binary matrices. We note that 



there is a typo in the variational approximation formula in Agarwal and Chen (2009 ) 



The basic idea is to transform binary response values into Gaussian response values 
before each EM iteration and then just use the E-Step and M-Step of the Gaussian 
model. 



Let $,ij be a parameter associated with each observed tjij. We can set all ^s 



u 



initially. 

• Before each E-step, create pseudo Gaussian response for each binary observa- 



tion i/ij G {0, 1}. The pseudo Gaussian response is 



2?;i; 



-1 



a. 



ij 2A(^,,) 



where A(^) = ^ tanh (|). 



'^ 4A(€,,) 



with variance 



Run the E-step using Gaussian pseudo observations {rij,afj). Details will be 
provided later. 



Run the M-step in Section 4.4 



m 



r 



• After the M-step, for each observation y.y, set ^ij = 

Now we describe the details of the E-step given the pseudo Gaussian observations 
afj). Repeat the following steps L times to draw L samples of A. 



«J 5 " ij 



Draw ai from Gaussian posterior p(Q;i|Rest) for each user i. 



Let o, 



V 



Var[ai\R,est] 



ij J\^ij) Pj '' 



(9) 



^[ailRest] = Far[ai I Rest] (^ + Y.jej., ^)- 

Draw Pj for each item j (similar to above). 

Draw Ui from Gaussian posterior {ui \ Rest) for each user i. 



Let o. 



^] 



fi^i 



ai - Pj, 

Var[u,\Rest] = {^J + EjeJ.'^r'- 
E[ui\Rest] = Var[ui\Rest]{^G{xi) + ^j^j^ 



(10) 



A2Jil 



OiiVi 
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• Draw Vj for each item j (similar to above). 

4.3 Adaptive Rejection Sampling in E-Step 

Although for binary data and logistic link function the conditional posterior p(aj|Rest), 
p{(3j\Rest), p('Uj|Rest) and p{vj\Rest) are not in closed form, precise and efficient 
sampling from the posterior can still be achieved through adaptive rejection sam- 
pling (ARS) ([Gilks , 1992). ARS is an efficient method to draw samples from an 



arbitrary univariate density provided it is log-concave. In our E-Step, we can draw 
a sample from the joint posterior distribution of A by drawing one number at a 
time sequentially from the univariate posterior distribution of each individual ran- 
dom effect given all the others using Gibbs sampling. To construct such a Gibbs 
sampler, we note that the univariate conditional posterior distributions p(-|Rest) are 
all log-concave; hence ARS can be applied. 

In general, rejection sampling (RS) is a popular method used to sample from a 
univariate distribution. Suppose we want to draw a sample from a non-standard 
distribution with density p{x). If one can find another density e{x) that is easier to 
sample from and approximates p{x) well and has tails heavier than p{x), then e{x) 
can be used to do rejection sampling. The key is to find a constant M such that 
p{x) < Me{x) for all points x such that p{x) > 0. For example, the blue curve in 
Figure[2]is Me{x) and the black solid curve is p{x). The algorithm then is simple: We 
repeat the following steps until we obtain a valid sample. First we draw a number x* 



X iiCli VVlLli piUUCXUllllj^ 

we reject it 



from e{x). Then with probability ]^j3J*) i we accept x* as a valid sample; otherwise. 



Notice that ^/'3j*) is always between and 1. This algorithm can be shown to 
provide a sample from p{x), and the acceptance probability is 1/M. Finding an M 
that is small often involves knowing the mode oi p{x); it is also important to find a 
good matching density e(x) in practice. ARS addresses both the issues. It finds a 
good matching density e(x) that is composed of piecewise exponentials; i.e., loge(a;) 
is piecewise linear like the blue curve in Figure |2} ARS does not need to know the 
mode of p{x), and the only requirement is the log-concavity of p{x), which is true 
for our problem. The piecewise exponentials are constructed by creating an upper 
envelope of the target log density. Further, the procedure is adaptive and uses the 
rejected points to further refine the envelope which reduces the rejection probability 
for future samples. 



We use the derivative-free ARS process from Gilks (1992) which can be briefiy 
described as follows: Suppose we want to obtain a sample x* from a log-concave target 
density function p{x). We start from at least 3 initial points such that at least one 
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Figure 2: Illustration of upper and lower bounds of an arbitrary (log) density func- 
tion. 

point lies on each side of the mode oip{x) (this is ensured by looking at the derivative 
of the density, which does not require actual mode computation). A lower bound 
lower{x) of \ogp{x) is constructed from the chords joining the evaluated points oip{x) 
with the vertical lines at the extreme points. For example, the dotted piecewise linear 
curve in Figure |2] is lower{x), while the solid black curve is logp(x). An upper bound 
upper{x) is also constructed by extending the chords to their intersection points. 
For example, the blue piecewise linear curve in Figure [2] is upper{x). The envelope 
function e(x) (upper bound) and the squeezing function s{x) (lower bound) are 
created by exponentiating the piece-wise linear upper and lower bounds of logp(x); 
i.e., e{x) = exp{upper{x)) and s{x) = exp{lower{x)) . Let ei(x) be the corresponding 
density function derived from e(x); i.e., ei(x) = j3^h^- The sampling produce works 
as follows: Repeat the following steps until we obtain a valid sample. 

• Draw a number x* from ei{x) and another number z ~ Unif(0, 1), indepen- 
dently. 

• li z < 4^, accept x* as a valid sample. 

• li z < ffey, accept x* as a valid sample; otherwise, reject x*. 

• If X* is rejected, update e(x) and s{x) by constructing new chords using x*. 

This goes on iteratively until one sample is accepted. Note that using the squeez- 
ing function as the acceptance criteria implies partial information from the original 
density p{x); Testing x* based on the squeezing function first is to save computation 
since the squeezing function is readily available from the constructed envelope and 
evaluation oi p{x*) is usually costly. 
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The ARS-based E-step works as follows: Repeat the following steps L times to 
draw L samples of A. 

• Sample Oj from p(aj|Rest) for each user i using ARS. The log of the target 
density is given by 

logp(ai|Rest) = constant 

- Ejej. Vij log(l + exp(-/(xij) - a^ - /3j - u[vj)) 

- Ejej.(l - yii) log(l + exp(/(xij) + a^ + Pj + u'^Vj)) 

- 2^(«i -9{xi)f- 



(11) 



Sample fij for each item j (similar to above). 

Sample Ui from p('Uj|Rest) for each user i. Since Ui is an r-dimensional vector, 
for each fc = 1, ■ ■ ■ ,r we sample Uik from p(Mjfc|Rest) using ARS. The log of 
the target density is given by 

logp(Mjfc|Rest) = constant 

- EjeJi Vij log(l + exp(-/(a;jj) - Oj - Pj - UikVjk 

(12) 

- EjGj- (1 ~ y«i ) log(l + exp(/(xjj) +ai + I3j + UifcWjfc 



+ ^l^kUil^Jl)) 



2;piV^ik ~ Gk{Xi)) . 



• Sample Vj for each item j (similar to above). 
Initial points for ARS: The rejection rate of ARS depends on the initial points 



and the target density function. To reduce the rejection rate, Gilks et al. (1995) 
suggest using the envelope function from the previous iteration of the Gibbs sampler 
to construct 5th, 50th and 95th percentiles as the 3 starting points. We adopted this 
approach in our sampling and observed roughly 60% reduction in rejection rates. 

Centering: We note that the model proposed in Section |3] is not identifiable. For 
example, if we let f{xij) = f{xij) — 6 and g{xi) = g{xi) + 6 where 6 can be any 
constant, the model using / and g is essentially the same as the one using / and 
g. To help identify the model parameters, we put constraints on the random effect 
values. Specifically, we require Ej '^j = 0; J2j Pj — 0; Ylii '"i = ^'^^ Ylij '"j — 0- 
These constraints induce dependencies among user random effects and item random 
effects. Instead of dealing with these dependencies in sampling, we simply enforce 
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these constraints after sampling by subtracting the sample mean; i.e., after sampling 
all the random effects, compute a = J2i ^i/^ ^"^^ set dj = d, — a for all i, and so 
on. Here, M is the number of users and dj is the posterior sample mean of a^. 

4.4 M-Step 

In the M-step, we find the parameter setting that maximizes the expectation 
computed in the E-step 

gi(e) = EA[logL(e;A,t/)|eW]. (13) 

It can be easily seen that (/, a^), ((7,cr^), {h,a^), (G, a^), and {H,a^) can be op- 
timized by separate regressions. Here we simply describe how to estimate (G, a^) 
since everything else is quite similar. Recall that Uik and V^[uifc] denote the posterior 
sample mean and variance of Uik computed based on the L Gibbs samples obtained 
in the E-step. It is easy to see that 

argmaxgi(©) = arg max ^ \\ui — G(xj)|p. (14) 

i 

Note that G is part of 0, and finding the optimal G is solving a least squares 
regression problem using Xi as covariates to predict multivariate response iii]. For 
univariate regression models, we consider G{xi) = {Gi{xi), ...,Gr{xi)), where each 
Gk{xi) returns a scalar. In this case, for each k, we find G^ by solving a regression 
problem that uses Xj as features to predict un.. Let RSS denote the total residual 
sum of squares. Then, o"^ = (^j^ V^i^ifc] + RSS)/(rM), which is obtained by setting 
the derivative of g* (0) with respect to a^ to zero. 

We note that obtaining the optimal / (i.e., argmaxj gi(0)) is actually difficult 
because of the expectation of the log of some combination of random effects. Thus, 
we use plug-in estimates; i.e., solve a logistic regression problem that uses Xij as 
features to predict yij with offset d, + f3j + u'iVj. 

5 Parallelized Model Fitting for Large Data 

In this section we consider fitting algorithms for large data sets that reside in dis- 
tributed clusters and cannot fit into memory of a single machine. For such scenarios, 
fitting algorithms described in Section [4| do not work. We provide a fitting strategy 



in a the Map- Reduce framework (Dean and Ghemawat, 2008). We first apply the 



"divide and conquer" approach to partition the data into small partitions, and then 
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run MCEM on each partition to obtain estimates of 0. The final estimate of are 
obtained by averaging over estimates of from all the partitions. Finally, given 
fixed, we do n ensemble runs, i.e. re-partition the data n times using different ran- 
dom seeds, and for each re-partitioning we only run E-Step jobs on all partitions and 
then average the results from them to obtain the final estimate of A. This algorithm 
is described in Algorithm [TJ 

Algorithm 1 Parallel EIRE Model Fitting 
Initialize and A. 

Partition data into m partitions using random seed sq. 
for each partition d. G {1, ...,m] running in parallel do 

Run MCEM algorithm for K number of iterations using VAR or ARS to obtain 

0^, the estimates of for each partition L 
end for 

m 

Let = ^ V 0^ 

i=\ 

for /c = 1 to n running in parallel do 

Partition data into m partitions using random seed s^. 
for each partition £ G {1, ..., m} running in parallel do 

Run E-Step-Only job given and obtain the posterior sample mean A^^ for 
all users and items in partition i,. 
end for 
end for 

For each user i, average over all A^^^ that contain user i to obtain dj and Uj. 
For each item j, average over all Afc£ that contain item j to obtain /3j and t)j. 



Partitioning the data: Extensive experiments conducted by us showed that model 
performance depends crucially on data partitioning strategy used in the Map- Reduce 
phase, especially when data is sparse. A naive way of randomly partitioning obser- 
vations may not give good predictive accuracy. For applications such as content 



optimization (Agarwal et al, 2008), the number of users are often much larger than 
the number of items. Also, the number of observations available per user is small for 
a large fraction of users; a typical item tends to have a relatively larger sample size. 
In such cases, we recommend partitioning the data by users, which guarantees that 
all data from a user belongs to the same partition, so that good user random effects 
can be obtained. Similarly, when the number of items is larger than the number of 
users, we recommend partitioning the data by items. An intuitive explanation of this 
can be gleaned by looking at the conditional variance of user random effect Ui using 
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variational approximation given as Var[Wj|Rest] = {^I + Yljej ~^) ^- Assuming 

item random effects are known for the moment (or estimated with high precision), 
if the user data is split into several partitions, the average information gain (inverse 
variance) from the partitioned data is the harmonic mean of information gain from 
individual partitions. The information gain from the non-partitioned data can be 
written as the arithmetic mean of the individual information gains. Since harmonic 
mean is less than arithmetic mean, the information loss in estimating the user random 
effects by partitioning is the difference in arithmetic and harmonic means. When the 
information in partitions becomes weak, this gap increases. Hence, with sparse user 
data, it is prudent to partition by users. 

Estimates of 0: We note the estimate obtained from each random partition is 
unbiased, fitting a model on each partition and then averaging the M-step parameters 
0£ for i = 1, ■ ■ ■ ,171 provide an estimate that is still unbiased and has lower variance 
due to lack of positive correlations among estimates. The correlations are absent due 
to the random partitioning. Before running the MCEM algorithm, the initial values 
of for all partitions are the same. In particular, we start with zero-mean priors; 
i.e., g{xi) = h{xj) = and G(xj) = H{xj) = 0. To improve parameter estimation, 
one may synchronize the parameters among partitions and run another round of 
MCEM iterations; i.e., one may re-partition the data and use the obtained as the 
initial values of to run another round of MCEM iterations for each partition to 
obtain a new estimate of 0. However, we observe in practice that iteratively running 
this process does not give significantly better predictive accuracy, but instead adds 
complexity and training time. 

Estimates of A: For each run in the ensemble, it is essential to use a different 
random seed for partitioning the data, so that the mix of users and items in partitions 
across different runs would be different. Given 0, for each run in the ensemble, 
we only need to run E-step once for each partition and obtain the final user and 
item random effects by taking the average. Again, the random partitioning ensures 
uncorrelated estimates from members of the ensemble and leads to variance reduction 
through averaging. 

More identifiability issues: After centering the model is in fact still non-identifiable 
because of two reasons: (a). Since u'-Vj = (— ■Uj)'(— t;^), switching signs of u and v 
(also the corresponding cold-start parameters) does not change the log-likelihood. 
(b). For any two random effect indices k and /, switching Uik with uu, Vjk with 
Vji for all users and items simultaneously also would not change the log-likelihood, 
given that the corresponding cold-start parameters are also switched. We have found 
empirically that both of the identifiability issues do not matter for small data sets, 
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especially single-machine runs. However, for large data sets such as the Yahoo! front- 
page data and G, H defined as linear regression function, we observe that for each 
partition after the MCEM step we obtain significantly different fitted values of G and 
if, so that after averaging over all the partitions the resulting coefficient matrices 
for G and H become almost zero. Hence the identifiability issue can become severe 
while fitting parallelized EIRE models for large data sets. 

Solution to the identifiability issues: For (a), we put constraints on the item 
random effects v so that it is always positive. This can be done through simply 
putting a sampling lower bound (i.e. always sample positive numbers) in the adap- 
tive rejection sampling. Note that after using this approach we do not need to do 
centering on v any more. For (b), we first let cr^ = 1 and change the prior of Ui 
from N{G{xi),a'^I) to N{G{xi),Ilu), where Hu is a diagonal variance matrix with 
diagonal values cr„i > au2 > • • ■ > o"ur- The model fitting is very similar; but after 
each M-step we re-sort all the random effects by the fitted o"„/c's for A; = 1, ■ ■ ■ , r to 
satisfy the constraint. 

6 Experiments 

We evaluate the proposed methods to address two main questions: (1) How do differ- 
ent techniques for handling binary response compare? (2) How do different methods 
perform in a real, large-scale web recommender system? For the first question, we 
compare variational approximation, adaptive rejection sampling and stochastic gra- 
dient descent on balanced and imbalanced binary datasets created from the public 
MovieLens IM dataset. For the second question, we first evaluate the predictive per- 
formance using a small balanced data set with heavy users of the Today module on 
the Yahoo! front page to allow comparison in the single-machine fitting scenario, and 
then provide complete end-to-end evaluation in terms of the click-lift metric through 



a recently proposed unbiased offline evaluation method (Li et al. (2011), which has 
been shown to be able to approximate the online performance) based on massive 
imbalanced binary response data collected from the Today module. 

Methods: We consider the following different models or fitting methods, all used 
with 10 factors per user/item throughout the experiments: 

• FEAT-ONLY is the covariate-interaction-only model which serves as our base- 
line. Specifically, the model is 

Sij = f{xij) + g{xi) + h{xj) + G{xi)'H{xj), 

where g, h, G and H are unknown regression functions, fitted by the standard 
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conjugate gradient descent method on each partition and averaging over esti- 
mates from all partitions to obtain estimates of g, h, G and H; no ensemble run 
is needed. 

MCEM-VAR is our regression-based BIRE model fitted by variational ap- 
proximation in the MCEM algorithm. 

MCEM-ARS is our regression-based BIRE model fitted by centered adaptive 
rejection sampling algorithm in each E-step of the MCEM algorithm. 
MCEM-ARSID is our regression-based BIRE model fitted by centered adap- 
tive rejection sampling algorithm in each E-step of the MCEM algorithm, incor- 
porating positive constraints on the item random effect (factor) v and ordered 
diagonal prior covariance matrix of u (see Section Is] for more details). 
SGD is a method that fits a similar BIRE model using stochastic gradient 



descent. We obtained the code from Charkrabarty et al. (2012). Specifically, 
the model is 

Sij = (oj + Ui + Ux,i)'{Pj + Vj + Vxj), 

where U and V are unknown coefficient matrices for cold-start to map the 
covariate vectors Xi and Xj into the r-dimensional latent space. For binary 
response with logistic link function, it minimizes the following loss function 

^ Vij log(l + exp(-Sij)) + ^(1 - Vij) log(l + exp(sij)) 

ij ij 

+XuY,\\uif + A, J^||t^,f + XuWUf + XvWVf, 

i J 

where A„, A^,, \u and Ay are tuning parameters, and \\U\\ and ||V|| are Frobe- 
nius norms. Since this code has not been parallelized, we only use it in ex- 
periments on small datasets. Trying different tuning parameter values can be 
computationally expensive. In the experiments, we set Xu = K = ^u = ^v = ^ 
with A varying from 0, 10~^, 10~^, 10~^ and 10~^. We also tuned the learning 
rate by trying 10-^ 10"^, 10-^ lO'^ and 10"^ 

In FEAT-ONLY , MCEM-VAR , MCEM-ARS and MCEM-ARSID, we use linear 

regression functions for g, h, G and H. 

6.1 MovieLens IM Data 

We first compare three techniques for fitting BIRE-style models with binary response 
(MCEM-VAR , MCEM-ARS and SGD) on the benchmark MovieLens IM dataset. 



Note that in Agarwal and Chen (2009), the regression-based BIRE model has been 
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AUG 


Method 


# Partitions 


Imbalanced Balanced 


SGD 


1 


0.8090 0.7413 


MCEM-VAR 


1 


0.8138 0.7576 


MCEM-ARS 


1 


0.8195 0.7563 





2 


0.7614 


0.7599 


MCEM-VAR 


5 


0.7191 


0.7538 




15 


0.6584 


0.7421 




2 


0.8194 


0.7622 


MCEM-ARS 


5 


0.7971 


0.7597 




15 


0.7775 


0.7493 



Table 1: AUC of different methods on the imbalanced and balanced MovieLens 
datasets (#partitions= 1 indicates single- machine runs) 

proved to be significantly better than various baseline models, such as zero-mean 



BIRE model and the Filterbot from Park et al. (2006) 



Data: The MovieLens IM data consists of IM ratings with scare from 1 to 5 provided 
by 6,040 users on set of 3,706 movies. We create training-test split based on the 
timestamps of the ratings; the first 75% of ratings serve as training data and the rest 
25% as test data. This split introduces many new users (i.e. cold-start) in test data. 
To study how different techniques handle binary response with different degree of 
sparsity of the positive response, we consider two different ways of creating binary 
response: (1) An imbalanced dataset is created by setting the response value to 1 
if and only if the original 5-point rating value is 1; otherwise it is set to 0. The 
percentage of positive response in this dataset is around 5%. (2) A balanced dataset 
is created by setting the response to 1 if the original rating is 1, 2, or 3; otherwise it 
is set to 0. The percentage of positive response in this dataset is around 44%. We 
report the predictive performance of SGD, MCEM-VAR and MCEM-ARS in terms 



of the Area Under the ROC Curve (AUC) for both datasets in Table 6.1 



Comparison between MCEM-ARS and MCEM-VAR : As can be seen from 



the Table 6J^, MCEM-ARS and MCEM-VAR have similar performance and both 
slightly outperform SGD when running on a single machine (i.e., ^partitions = 1). 
It is interesting to see that, when running on multiple machines with 2 to 15 parti- 
tions, MCEM-ARS and MCEM-VAR still have similar performance on the balanced 
dataset, while on the imbalanced dataset MCEM-VAR becomes much worse when 
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the number of partitions increases (causing more severe data sparsity). We note that 
the degradation of performance when the number of partitions increases is expected 
because, with more partitions, each partition would have less and sparser data, which 
leads to a less accurate model for the partition. 

Comparison with SGD: Since SGD is a popular fitting method for SVD-style 



matrix factorization (Koren et al., 2009), we also discuss how our sampling-based 
methods compare to SGD. To obtain good performance for SGD, one has to try a 
large number of different values of the tuning parameters and learning rates, while 
our methods does not need such tuning because all the hyper-parameters are ob- 
tained through the EM algorithm. Trying different tuning parameter values can be 
computationally expensive, and it is less efficient in exploring the parameter space 
compared to an EM algorithm. After our best-effort tuning using the test data, for 
imbalanced data, SGD achieves best performance 0.8090 with A = 10~^ and learn- 
ing rate = 10~^. For balanced data, SGD achieves best performance 0.7413 with 
A = 10~^ and learning rate = 10^^. Even tuning SGD on test data, the best AUG 
numbers of SGD on both balanced and imbalanced datasets are still slightly worse 
than the those of MCEM-VAR and MCEM-ARS (which did not touch test data 
before testing). 

6.2 Small Yahoo! Front Page Data 

We now evaluate different BIRE model fitting methods on a previously analyzed 



Yahoo! front page dataset (Agarwal and Chen, 2009), which allows comparison of 
these methods to prior work. 

Data: This dataset consists of 1.9M binary response values (click or non-click) 
obtained from about 30K heavy users interacting with 4,316 news articles published 
in the Today module on the Yahoo! front page. The observations were sorted by 
their timestamps and the first 75% of them are used as training data and the rest 
25% as test data. The set of user covariates include age, gender, geo-location and 
browsing behavior that is inferred based on users' network-wide activity (e.g. search, 
ad-clicks, page views, subscriptions etc.) Since the original set of user covariates is 



large, dimension reduction was done through principal component analysis (Agarwal 



and Chen, 2009), and finally we obtained around 100 numerical user covariates. Item 
covariates consist of 43 hand-labeled editorial categories. 

Recall that the Today module displays article links on four positions labeled as 
Fl through F4, where the Fl article resides in a large and prime area. Also, a hover 
on a non-Fl article link will bring the article link to the prime area. A click on 
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Method 


# Partitions 


Partition Method 


AUC 


FEAT-ONLY 


1 


- 


0.6781 


SGD 


1 


— 


0.7252 


MCEM-VAR 


1 


- 


0.7374 


MCEM-ARS 


1 


- 


0.7364 


MCEM-ARSID 


1 


- 


0.7283 




2 


User 


0.7280 


MCEM-ARS 


5 


User 


0.7227 




15 


User 


0.7178 




2 


User 


0.7294 




5 


User 


0.7172 


MCEM-ARSID 


15 


User 


0.7133 




15 


Event 


0.6924 




15 


Item 


0.6917 



Table 2: AUC of different methods on the small Yahoo! front page dataset 
(7^partitions= 1 indicates single-machine runs) 

an article link in the prime area will then lead the user to the actual article page. 
For this data set, clicks on article links in the prime area are interpreted as positive 
response, while displays of article links in the prime area (i.e. hover on a non-Fl 
article) without subsequent clicks are considered as negative response. Also note 
that user visits with no click on any position were ignored. Hence, in this data set, 
the percentage of positive response is close to 50% — it is a balanced data set. 

Single-machine results: We first discuss the AUC performance for FEAT-ONLY 
, MCEM-VAR , MCEM-ARS and MCEM-ARSID running on a single machine (i.e. 



1 partition), shown in Table |0} We observe that MCEM-VAR , MCEM-ARS , 
MCEM-ARSID and SGD all outperform FEAT-ONLY significantly. This is because 
these models allow warm-start user random effects (those users having data in the 
training period) to deviate from purely covariate-based predictions, in order to better 
fit the data. On the other hand, since the test data consists of many new users and 



new items, handling cold-start scenarios is still important. It has been shown in Agar- 



wal and Chen (2009) that, for this data set, MCEM-VAR significantly improves upon 
EIRE models that use zero mean priors for random effects, which is commonly ap- 
plied in many recommender system problems such as Netflix, and MCEM-VAR also 
significantly outperformed various other collaborative filtering algorithms. We note 
that the performance of MCEM-VAR , MCEM-ARS and MCEM-ARSID are all 
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close. This suggests that for balanced datasets, different fitting methods for logistic 
models are similar. We also note that MCEM-ARSID perform slightly worse than 
MCEM-ARS , because adding constraints on the item random effects v reduces the 
flexibility of MCEM-ARSID . We defer the discussion on when MCEM-ARSID can 



provide significant benefit to Section 6.3 



Comparison with SGD: Similar to what we see in Section 6.1 even with SGD 
tuned on test data, the best AUC 0.7252 (achieved by using A = 10~^ and learning 
rate = IQ-^) is still slightly worse than the AUC of MCEM-VAR , MCEM-ARS and 
MCEM-ARSID for single machine runs. 



Number of partitions: In Table 6.2 for MCEM-ARS and MCEM-ARSID (10 en- 



semble runs for both), as the number of partitions grows, we observe the expected 



degradation of performance, similar to what we observed in Section 6.1 However, 
even with 15 partitions on such a small data set MCEM-ARS and MCEM-ARSID 
(user-based partitioning) still significantly outperforms FEAT-ONLY . In general, 
increasing the number of partitions would increase computational efficiency, but usu- 
ally leads to worse performance. We have observed in our experiments that for large 
data sets the computation time of 2A^ partitions is roughly half of using A^ partitions. 
Therefore we use as few partitions as possible given our computational budget. 



Different partition methods: In Table 6.2 we also show the performance of our 



parallel algorithm MCEM-ARSID (10 ensemble runs) with different numbers of par- 
titions and various partition methods. As mentioned in Section |5j we note that 
partitioning the data by users is better than event-based or item-based partitioning 
in our application. The reason for this is that in our application, there are generally 
more users than items in the data; hence user partitions are less sparse. 

6.3 Large Yahoo! Front Page Data 

In this subsection, we show the performance of our parallel algorithms on a large 
Yahoo! front page dataset where single-machine fitting algorithms are not feasible. 
An unbiased evaluation method is used to estimate the expected dick-lifts if these 



algorithms were used in the production system (Li et al. , 2011 ). Agarwal et al. (2011 ) 
used the same click-lift metric to measure the model performances. 

Data: The training data was collected from the Today module on Yahoo! front 
page during June 2011, while the test events were collected during July 2011. The 
training data includes all page views by users with at least 10 clicks in the Today 
module, and consists of 8M users, ~4.3K items and 1 billion binary observations. To 
remove selection bias in evaluating our algorithms, the test data is collected from a 
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randomly chosen user population where, for each user visit, an article is selected at 
random from the content pool and displayed at the Fl position. We shall refer to 
this as random bucket, which consists of around 2.4M clicks with old users who were 
seen in the training period as well as new ones. 

Each user is associated with 124 behavior covariates that reflect various kinds of 
user activities on the entire Yahoo! network. Each item is associated with 43 editorial 
hand-labeled categories. A click on an Fl article link is a positive observation, while 
a view of an Fl article link without a subsequent click is a negative observation. The 
percentage of positive response here is much lower than that of the small dataset 
(4%-10%) — the increased sparsity and imbalance introduces additional challenges. 

Experimental setup: Because article lifetimes in the Today module are short (6-24 
hours), almost all items in the test period are new. To provide good performance for 
new items, one may frequently re-train BIRE models in the test period. However, 
since the amount of data is large, frequent re-training is not a feasible solution. Note 
that the set of users that come to Yahoo! are much less dynamic than items, hence a 
viable solution is to assume the user random effects (factors) from the training period 
is fixed and learn the item random effects in an online fashion. More precisely, let Xi 
denote the behavior covariate vectors of user i and w, denote the user random effect 
vector produced by a BIRE model that is learned in training period. For each item 
j at time t in t est period, we f i t an i ndividual online logistic regression (OLR) model 
as described in Agarwal et al. (2010) with log-odds x^f3jt + u^5jt, where the unknown 



parameters (/3jt, 8jt) are updated onhne after each test epoch as we collect more data 
on each item. The OLR models are initialized with a prior (/3jO) <5jo) ~ MVN{0, a^I). 
Notice that different BIRE fitting methods generate different UiS. The performance 
of a method is based on click-lift of the recommendations generated based on the 
OLR models using the itjS. 

Unbiased evaluation: The goal of this set of experiments is to maximize total 
number of clicks. The precision® 1 metric computed on the random bucket test set 
was shown to provide an unbiased measure of an algorithm's performance when it is 



actually implemented in production (Li et al, 2011). We provide a brief description 
of the evaluation metric below. 

For an epoch t (5-minute interval) in the test period, we do the following: 

1. Compute the predicted CTR of all articles in the pool for each event in epoch t 
under the model, based on the user covariates and latent random effects. The 
estimates can use all data before epoch t. 

2. For each click event at time t in the test data, we select the an article j* 
from the current pool with the highest predicted probability. If the article that 
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was actually clicked in the test data matches j*, we give the model a reward; 
otherwise, we ignore it. 

At the end, we compute click lift metric based on the total reward received from the 
model. Mathematically, for the test data from a random bucket and model M, the 
score S{M) can be defined as 

S{M) = y, l(item clicked = item selected by M). (15) 

visits with click 

It has been proved that S{M) is an unbiased metric compared to the real click 



lift seen in a production system (Li et ai, 2011). Because each article in the random 



bucket has an equal probability to be displayed to users, the number of matched 
view events for any model is expected to be the same. A better model to optimize 
CTR can match more click events. For large amounts of data as in our case, the 
variance of the click-lift metric for any model is very small; all differences reported 
in our experiments have small p-values and are statistically significant due to large 
sample size in the test data. 

Two baseline methods:. To show that random-effect-based user covariates (i.e. 
Ui) provide state-of-the-art performance to personalize content on Yahoo! front page, 
in this paper we implement two baseline methods of generating user covariates based 
on users' past interaction on the Today module: 

ITEM-PROFILE: Using training data we pick top 1000 items that have highest 
number of views. We construct 1000-dimensional binary user profiles to indicate 
whether in the training period this user has ever clicked on this item (1 is clicked 
and is non-clicked). For cold-start users that did not show up in training data, we 
simply let the binary profile vector to be all 0. 

CATEGORY-PROFILE: Since in this dataset each item has 43 binary covari- 
ates indicating content categories to which the item belongs, we build user-category 
preference profiles through the following approach: For user i and category k, denote 
the number of observed views as Vik and number of clicks as Cik- From the training 
data we can obtain the global per-category CTR, denoted as 7^. We then model 
Cik as Cik ~ Poisson^Vik'jkKk) , where Xik is the unknown user-category preference 
parameter. We assume Xik has a Gamma prior Gamma{a,a), hence the posterior of 
Xik becomes {Xik\vik,Cik) ~ Gamma{cik + a,Vik'jk + ci)- We use the log of the posterior 
mean, i.e. log( ^^''°"^° ) as the profile covariate value for user i on category k. Note 
that if we do not observe any data for user i and category k, the covariate value 
becomes 0. a is a tuning prior sample size parameter and can be obtained through 
cross-validation. By trying a =1, 5, 10, 15 and 20, we have found that for this data 
set a = 10 is the optimal value. 
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Method 


#Ensembled 


Overall 


Warm 


Cold 




Runs 




Start 


Start 


ITEM-PROFILE 


- 


3.0% 


14.1% 


-1.6% 


CATEGORY-PROFILE 


- 


6.0% 


20.0% 


0.3% 


MCEM-VAR 


10 


5.6% 


18.7% 


0.2% 


MCEM-ARS 


10 


7.4% 


26.8% 


-0.5% 


MCEM-ARSID 


1 


9.1% 


24.6% 


2.8% 


MCEM-ARSID 


10 


9.7% 


26.3% 


2.9% 



Table 3: The overall click lift over the user behavior covariate (BT) only model. 

Experimental Results: We evaluate all methods by reporting click-lift obtained 
through the unbiased evaluation method relative to an online logistic model that 
only uses behavioral (BT) covariates Xf, such a model does not incorporate users' 
past interaction with items — its performance on heavy users has large room for 



improvement. In Table 6.3, we summarize the overall lift, warm start lifts (users 
seen in the training set), and cold-start lifts (new users). All models produce lifts 
but the performance of MCEM-ARSID is the best for overall and cold-starts, and 
MCEM-ARS is the best for warm-starts. The reason that we see no lift for cold-start 
users on MCEM-ARS is because of the identifiability issues addressed in Section [5] 
Although imposing positive constraints on the item random effects leads MCEM- 
ARSID to have slightly inferior performance than MCEM-ARS for warm-start users, 
it solves the identifiability issues quite well and hence gives the best performance 
for the cold-start users. It is also interesting to see that MCEM-VAR is worse than 
CATEGORY-PROFILE, especially for warm-starts. We also observe that using the 
ensemble trick improves results as evident from comparing MCEM-ARSID with 1 
and 10 ensemble runs. 

To further investigate the performance of algorithms in different segments of 
warm-start users based on user activity levels on Today module in training period, 
we look at click-lifts by Today module activity levels in Figure [3j We split the 
users in the test data into several segments by their number of clicks in the training 
data. As expected, we see a near-monotone trend; users with more activity are 
personalized better by using their prior Today Module activity data. From Figure 
^we observe that MCEM-ARSID is uniformly better than CATEGORY-PROFILE 
and ITEM-PROFILE over all the user segments. Comparing performance of MCEM- 
ARSID, MCEM-ARS and MCEM-VAR we find the MCEM-VAR to be quite inferior 
to MCEM-ARS and MCEM-ARSID. 
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Figure 3: The click lift over the user behavior covariate (BT) only model for different 
user segments. The segments are created from the number of clicks in the training 
data. 

Potential issue with variational approximation: To investigate issues with 
MCEM-VAR with data sparsity, we examine the random effect estimates in Fig- 
ure [4] which shows the histograms of the fitted Ui and Vj after 30 EM iterations 
for MCEM-VAR and MCEM-ARS , both with 10 factors and 100 partitions. While 
the fitted user random effects for both MCEM-VAR and MCEM-ARS are in the 
similar scale, the item random effects produced by the variational approximation is 
approximately one order of magnitude smaller than those produced by MCEM-ARS 
. This phenomenon is in fact surprising and shows that MCEM-VAR tends to over- 
shrink the random effect estimates when fitting rare response. Similar phenomenon 
has been observed by iZhang and Agarwal (2008). This explains why the perfor- 



mance of MCEM-VAR deteriorates as the binary response gets rare. It seems that 
the variational approximation leads to too much shrinkage when working with rare 
response. 

6.4 Discussion of Results 

The experiments clearly show that regression-based BIRE models for binary response 
and using a divide and conquer stategy to scale the method in a Hadoop framework 
involves several subtle issues. For scenarios where we can fit the model using a single 
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u for MCEM-ARS 



V for MCEtl/l-ARS 



0.0 0.2 0.4 



-0.015 -0.005 0.005 0.015 



u for MCEM-VAR 



V for MCEM-VAR 



0.0 0.2 0.4 



-0.005 0.005 0.015 



Figure 4: The histogram of the fitted Ui and Vj after 30 iterations of the MCEM 
step for MCEM-VAR and MCEM-ARS , both with 10 factors and 400 partitions. 

machine, all methods work equally well on balanced binary response, the case widely 
studied in prior work. 

For highly imbalanced data, MCEM-VAR tends to deteriorate, we do not recom- 
mend its use in such scenarios. SGD works well provided that the learning rates and 
regularization parameters are tuned carefully, hence we do not recommend its use 
unless such tuning is undertaken seriously. Even after tuning, it is inferior to MCEM 
methods so we recommend using MCEM if possible. For single machine MCEM, 
imposing positivity constraints in MCEM-ARSID hurts performance slightly since it 
adds additional constraints. Therefore, we do not recommend it, instead we recom- 
mend fitting MCEM-ARS . 

The story is totally different when fitting map-reduce with divide and conquer. 
Since the BIRE models are multi-modal, each partition may converge to a very 
different regression estimate so that simple average of the regression coefficients leads 
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to poor performance. Here, we highly recominend making all the efforts to impose 
identifiability through MCEM-ARSID and synchronizing the initializations. We also 
recommend using the ensemble trick since it only uses the E-step and does not add 
too much to the computations. We discourage the use of MCEM-VAR since it breaks 
quite spectacularly with high sparsity. 

7 Conclusion 

In this paper, we introduced the adaptive rejection sampling (ARS) to our probabilis- 
tic regression-based bilinear random effects (EIRE) modeling framework to handle 
data sets with binary response in a better way. We note that data with binary 
response is common in web applications such as content optimization and compu- 
tational advertising. We also extended our EIRE model fitting methods to handle 
large data sets using Map-Reduce. Ey extensive experiments on benchmark datasets 
and the Yahoo! FrontPage Today Module data sets, we show that our model and 
fitting algorithms are stable and can significantly outperform variational approxima- 



tion proposed by Agarwal and Chen (2009); Zhang et al. (2011) and several other 



baselines. We also notice that carefully handling idenfiability issues have crucial 
impact on the EIRE model performance while handling large-scale data sets using 
Map- Reduce. 
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